Overview
The goal of this competition is to develop a model that detects personally identifiable information (PII) in student writing. Your efforts to automate the detection and removal of PII from educational data will lower the cost of releasing educational datasets. This will support learning science research and the development of educational tools.

Reliable automated techniques could allow researchers and industry to tap into the potential that large public educational datasets offer to support the development of effective tools and interventions for supporting teachers and students.


Description
In today’s era of abundant educational data from sources such as ed tech, online learning, and research, widespread PII is a key challenge. PII’s presence is a barrier to analyze and create open datasets that advance education because releasing the data publicly puts students at risk. To reduce these risks, it’s crucial to screen and cleanse educational data for PII before public release, which data science could streamline.
Manually reviewing the entire dataset for PII is currently the most reliable screening method, but this results in significant costs and restricts the scalability of educational datasets. While techniques for automatic PII detection that rely on named entity recognition (NER) exist, these work best for PII that share common formatting such as emails and phone numbers. PII detection systems struggle to correctly label names and distinguish between names that are sensitive (e.g., a student's name) and those that are not (e.g., a cited author).
Competition host Vanderbilt University is a private research university in Nashville, Tennessee. It offers 70 undergraduate majors and a full range of graduate and professional degrees across 10 schools and colleges, all on a beautiful campus with state-of-the-art laboratories. Vanderbilt is optimized to inspire and nurture cross-disciplinary research that fosters groundbreaking discoveries.
For this competition, Vanderbilt has partnered with The Learning Agency Lab, an Arizona-based independent nonprofit focused on developing the science of learning-based tools and programs for the social good.
Your work in creating reliable automated techniques to detect PII will lead to more high-quality public educational datasets. Researchers can then tap into the potential of this previously unavailable data to develop effective tools and interventions that benefit both teachers and students.

Evaluation
Submissions are evaluated on micro 
, which is a classification metric that assigns value to recall and precision. The value of 
 is set to 5, which means that recall is weighted 5 times more heavily than precision.

Submission File
For each document in the test set, you must predict which token values have a positive PII label. You should only include predictions of positive PII label values. Outside labels O should not be included. Each row in the submission should correspond to a single label found at a unique document-token pair. Additionally, the evaluation metric requires a row_id with an enumeration of predicted labels.

The file should contain a header and have the following format:

row_id,document,token,label
0,7,9,B-NAME_STUDENT
1,7,10,I-NAME_STUDENT
2,10,0,B-NAME_STUDENT
etc.


Prizes
Leaderboard Prizes

1st Place - $13,000
2nd Place - $10,000
3rd Place - $5,000
Efficiency Prizes

1st Place - $15,000
2nd Place - $12,000
3rd Place - $5,000
Code Requirements


Submissions to this competition must be made through Notebooks. In order for the "Submit" button to be active after a commit, the following conditions must be met:

CPU Notebook <= 9 hours run-time
GPU Notebook <= 9 hours run-time
Internet access disabled
Freely & publicly available external data is allowed, including pre-trained models
Submission file must be named submission.csv
Please see the Code Competition FAQ for more information on how to submit. And review the code debugging doc if you are encountering submission errors.

Efficiency Prize Evaluation
Efficiency Prize
We are hosting a second track that focuses on model efficiency, because highly accurate models are often computationally heavy. Such models have a stronger carbon footprint and frequently prove difficult to utilize in real-world educational contexts. We hope to use these models to help educational organizations that have limited computational capabilities.

For the Efficiency Prize, we will evaluate submissions on both runtime and predictive performance.

To be eligible for an Efficiency Prize, a submission:

Must be among the submissions selected by a team for the Leaderboard Prize, or else among those submissions automatically selected under the conditions described in the My Submissions tab.
Must be ranked on the Private Leaderboard higher than the sample_submission.csv benchmark.
Must not have a GPU enabled. The Efficiency Prize is CPU Only.
All submissions meeting these conditions will be considered for the Efficiency Prize. A submission may be eligible for both the Leaderboard Prize and the Efficiency Prize.

An Efficiency Prize will be awarded to eligible submissions according to how they are ranked by the following evaluation metric on the private test data. See the Prizes tab for the prize awarded to each rank. More details may be posted via discussion forum updates.

Efficiency Score
We compute a submission's efficiency score by:

where 
 is the submission's score on the main competition metric, 
 is the score of the benchmark sample_submission.csv, 
 is the maximum 
 score of all submissions on the Private Leaderboard, and 
 is the number of seconds it takes for the submission to be evaluated. The objective is to minimize the efficiency score.

During the training period of the competition, you may see a leaderboard for the public test data in the following notebook, updated daily: Efficiency Leaderboard. After the competition ends, we will update this leaderboard with efficiency scores on the private data. During the training period, this leaderboard will show only the rank of each team, but not the complete score.

Acknowledgements
Vanderbilt University would like to thank The National AI Institute for Adult Learning and Online Education (AI-ALOE) for their support in making this work possible.

Citation
Langdon Holmes, Scott Crossley, Perpetual Baffour, Jules King, Lauryn Burleigh, Maggie Demkin, Ryan Holbrook, Walter Reade, and Addison Howard. The Learning Agency Lab - PII Data Detection. https://kaggle.com/competitions/pii-detection-removal-from-educational-data, 2024. Kaggle.

Dataset Description
Update

You may read more about PII Data from the following publication:

Holmes, L., Crossley, S. A., Sikka, H., & Morris, W. (2023). PIILO: An open-source system for personally identifiable information labeling and obfuscation. Information and Learning Science, 124 (9/10), 266-284. [link]
The competition dataset comprises approximately 22,000 essays written by students enrolled in a massively open online course. All of the essays were written in response to a single assignment prompt, which asked students to apply course material to a real-world problem. The goal of the competition is to annotate personally identifiable information (PII) found within the essays.

In order to protect student privacy, the original PII in the dataset has been replaced by surrogate identifiers of the same type using a partially automated process. A majority of the essays are reserved for the test set (70%), so competitors are encouraged to use external datasets that are publicly available to bolster the training data.

PII Types
The competition asks competitors to assign labels to the following seven types of PII:

NAME_STUDENT - The full or partial name of a student that is not necessarily the author of the essay. This excludes instructors, authors, and other person names.
EMAIL - A student’s email address.
USERNAME - A student's username on any platform.
ID_NUM - A number or sequence of characters that could be used to identify a student, such as a student ID or a social security number.
PHONE_NUM - A phone number associated with a student.
URL_PERSONAL - A URL that might be used to identify a student.
STREET_ADDRESS - A full or partial street address that is associated with the student, such as their home address.
File and Field Information
The data is presented in JSON format, which includes a document identifier, the full text of the essay, a list of tokens, information about whitespace, and token annotations. The documents were tokenized using the SpaCy English tokenizer.

Token labels are presented in BIO (Beginning, Inner, Outer) format. The PII type is prefixed with “B-” when it is the beginning of an entity. If the token is a continuation of an entity, it is prefixed with “I-”. Tokens that are not PII are labeled “O”.

{test|train}.json - the test and training data; the test data given on this page is for illustrative purposes only, and will be replaced during Code rerun with a hidden test set.

(int): the index of the essay
document (int): an integer ID of the essay
full_text (string): a UTF-8 representation of the essay
tokens (list)
(string): a string representation of each token
trailing_whitespace (list)
(bool): a boolean value indicating whether each token is followed by whitespace.
labels (list) [training data only]
(string): a token label in BIO format
sample_submission.csv - An example of the correct submission format. See the Submission File section of the Overview page for details.
